IEEE TRANSACTIONS ON SIGNAL PROCESSING 



1 



Receive Combining vs. Multi-Stream Multiplexing 
in Downlink Systems with Multi- Antenna Users 

Emil Bjornson, Member, IEEE, Marios Kountouris, Member, IEEE, Mats Bengtsson, Senior Member, IEEE, 

and Bjorn Ottersten, Fellow, IEEE 



Abstract — In downlink multi-antenna systems with many users, 
the multiplexing gain is strictly limited by the number of 
transmit antennas N and the use of these antennas. Assuming 
that the total number of receive antennas at the multi-antenna 
users is much larger than N 9 the optimal multiplexing gain 
can be achieved under many different receive strategies. For 
example, the abundance of receive antennas can be utilized to 
schedule users with near-orthogonal channels, for multi-stream 
multiplexing to users with well-conditioned channels, and/or to 
enable interference-aware receive combining. In this paper, we 
try to answer the question if the N data streams should be 
divided among few users (many streams per user) or many 
users (few streams per user, enabling receive combining). Analytic 
results are derived to show how user selection, spatial correlation, 
heterogeneous user conditions, and imperfect channel acquisition 
(quantization or estimation errors) affect the performance when 
sending the maximal number of streams or one stream per 
scheduled user — two extremes in stream allocation. 

While contradicting observations on this topic have been 
reported in prior work, we show that selecting many users and 
allocating one stream per user (i.e., exploiting receive combining) 
is the best candidate under realistic conditions. This is explained 
by the provably stronger resilience towards spatial correlation 
and larger benefit from multi-user diversity. This fundamental 
result has positive implications for the design of downlink systems 
as it reduces the hardware requirements at the users. 

Index Terms — Multi-user MIMO, channel estimation, quan- 
tized feedback, block-diagonalization, zero-forcing, receive com- 
bining. 

I. Introduction 

The performance of downlink wireless communication sys- 
tems can be improved by multi-antenna techniques, which 
enable efficient utilization of spatial dimensions. Depending on 
the available channel state information (CSI), these dimensions 
can be used for enhanced reliability and/or spatial multiplexing 
of multiple data streams with controlled interference (TJ. The 
downlink single-cell sum capacity (with perfect CSI) behaves 
as 

min(7V, MK) log 2 (P) + 0(1) (1) 

The research leading to these results has received funding from the Euro- 
pean Research Council under the European Communitys Seventh Framework 
Programme (FP7/2007-2013) / ERC grant agreement number 228044. This 
work was presented in part at the IEEE Swedish Communication Technologies 
Workshop (Swe-CTW), Stockholm, October 2011. 

E. Bjornson, M. Bengtsson, and B. Ottersten are with the Signal Pro- 
cessing Laboratory, ACCESS Linnaeus Center, KTH Royal Institute of 
Technology, SE-100 44 Stockholm, Sweden (e-mail: emil.bjornson@ee.kth.se; 
mats.bengtsson@ee.kth.se; bjorn.ottersten@ee.kth.se). B. Ottersten is also 
with Interdisciplinary Centre for Security, Reliability and Trust (SnT), 
University of Luxembourg, L-1359 Luxembourg-Kirchberg, Luxembourg 
(email: bjorn. ottersten @uni.lu). M. Kountouris and E. Bjornson is with 
SUPELEC (Ecole Superieure dElectricite), Gif-sur-Yvette, France (e-mail: 
marios .kountouris @ supelec .fr) . 



where TV is the number of base station antennas, K is the 
number of users, each user has M antennas (M < N), and P 
is the signal-to-noise ratio (SNR) defined as the total transmit 
power divided by the noise power. The number of users is 
typically large (i.e., K > N) and thus the optimal multiplexing 
gain is min(7V, MK) = N. The multiplexing gain will have 
a major impact on the throughput of future cellular networks, 
which are expected to increase the cell density and thereby 
achieve high SNRs in a power-efficient way El. 

The sum capacity in ([T]) can in theory be achieved using 
dirty-paper coding [3], but this non-linear scheme has im- 
practical complexity and is very sensitive to CSI imperfec- 
tions. Fortunately, the optimal multiplexing gain of TV can be 
achieved by linear spatial division multiple access (SDMA) 
strategies |4], such as block-diagonalization (BD) (5), (6) 
and zero-forcing with combining (ZFC) IT), (8]. Such SDMA 
strategies transmit N simultaneous data streams, but can divide 
them among the users in different ways; the system can select 
between |~||] and N users to be active and allocate from 1 
to M streams to each of them. This raises the fundamental 
design question of how the receive antennas at each user 
should be used to maximize the system performance. Inter- 
user interference degrades user performance, while the mutual 
interference between users' own streams can be handled by 
receive processing. Thus, it seems beneficial to only have a 
few active users and multiplex many streams to each of them, 
but one should keep in mind that every additional stream 
allocated to a user experiences a weaker channel gain than 
the previous streams. If fewer than M streams are allocated 
to a user, this user has degrees of freedom for interference- 
aware receive combining to achieve a strong effective channel 
and better spatial co-user compatibility. In other words, it 
is not clear whether receive antennas should be utilized for 
multi-stream multiplexing or receive combining, or perhaps 
something intermediate. The answer has a profound impact on 
wireless system design, including the CSI acquisition protocols 
and receiver architecture. 

A. Related Work 

The answer to the question above can, in principle, be 
obtained by studying which allocation of data streams and 
selection of linear precoding maximizes the sum-rate perfor- 
mance. Unfortunately, this optimization problem is nonconvex 
and combinatorial, thus only suboptimal strategies can be 
applied in practice. Such low-complexity algorithms have been 
proposed in |9|-p2|, among others, by successively allocating 
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data streams to users in a greedy manner. Simulations in 
these papers indicate that fewer than N streams should be 
used when P and K are small, and that spatial correlation 
makes it beneficial to divide the streams among many users. 
Simulations in [10 ] indicates that the probability of allocating 
more than one stream per user is small when K grows 
large, but they only consider users with homogeneous channel 
conditions and all the referred papers assume perfect CSI. 

The authors of |8) claim that transmitting at most one 
stream per user (i.e., exploiting receive combining) is desirable 
when there are many users in the system. They justify their 
statement using asymptotic results from (13), where this 
approach achieves the sum capacity as K — >• oo. But such 
argumentation ignores some important issues: 1) asymptotic 
optimality can also be proven with multiple streams per user Q 
2) the performance at practical number of users is unknown 
(K needs to be very large to approach capacity); and 3) the 
analysis implies an unbounded asymptotic multi-user diversity 
gain, which is a modeling artifact of Rayleigh fading channels 

The authors of (6), (7) arrive at a different conclusion 
when they compare BD (which selects users and sends M 
streams/user) and ZFC (which selects TV users and sends one 
stream/user) under quantized CSI. Their numerical analysis 
reveals a distinct advantage of BD (i.e., multi-stream multi- 
plexing), but is limited to uncorrelated channels and neither in- 
cludes user selection nor exploits interference rejection. Even 
under these suboptimal system conditions, we show herein 
that their results are misleading; their simulation considers a 
feedback load insufficient for SDMA, meaning that single-user 
transmission greatly outperforms both BD and ZFC. 



B. Main Contributions 

This paper provides a comprehensive answer to how multi- 
antenna users should utilize their antennas in downlink trans- 
missions, or similarly how many data streams that should be 
allocated per active user under different system conditions; see 
Fig. [T] The main contributions are: 

• New analytic results for analyzing the problem under spa- 
tial correlation, user selection, heterogeneous user chan- 
nel conditions, and realistic CSI acquisition. These enable 
asymptotic comparison of the two extremes: allocating 
M streams per active user (asymptotically represented 
by BD) and one stream per active user (asymptotically 
represented by ZFC). We show that ZFC is more re- 
silient to spatial correlation and well adapted to find 
near-orthogonal users, while BD is better at utilizing 
heterogeneous user conditions. Imperfect CSI acquisition 
is shown to have a similar impact on both strategies. 

• Numerical illustrations show that allocating one stream 
per active user is essentially optimal under realistic sys- 
tem conditions, and we explain how other conclusions 
may arise. The main conclusion is therefore that utilizing 

! The uplink analysis in (141 shows that a non-zero (but bounded) number 
of users can use multiple streams, and the well-established uplink-downlink 
duality makes this result applicable also in our downlink scenario. 




(a) 1 stream per 2-antenna user: ZFC enables receive combining. 




(b) 2 streams per 2-antenna user: BD exploits multi-stream multiplexing. 

Fig. 1. Two ways of dividing four data streams among multi-antenna users, 
which also represents two ways of utilizing the receive antennas to reduce 
interference, (a) Receive one stream per user and linearly combine the antenna 
to achieve an effective channel that rejects interference, (b) Receive multiple 
streams and handle their mutual interference through receive processing. 
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(a) Basic cyclic operation of a block- fading FDD system. 
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(b) Basic cyclic operation of a block-fading TDD system. 
Fig. 2. Basic system operation for FDD and TDD systems. 



receive combining is preferable over multi- stream multi- 
plexing. 



II. System Model 

We consider a downlink multi-user MIMO system where a 
single base station with N antennas communicates with K > 
N users. Each user has M < N antennas and we will often 
assume that is an integer, for analytic convenience. The 
narrowband, flat-fading channel to user k is represented in the 
complex-baseband by e (qMxat received signal at 
this user is 

H fc x + n k (2) 



where x G C x is the transmitted signal and ilk ~ 
£/V(0,Im) is the (normalized) noise vector. For analytic 
convenience and motivated by measurements (16), |T7| , we 
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1/2 — 1/2 

employ the Kronecker model with H k = fc HfcR TA ., 
where Rt,/c and Rr,*; are the positive-definite spatial cor- 
relation matrices at the transmitter and receiver side, re- 
spectively, and Hfe has independent £/V(0, 1) -entries. We 
assume Rt,& = Ijv (i- e -> large antenna separation at the base 
station) throughout the analysis, because transmit correlation 
both creates complicated mathematical structures and requires 
limiting assumptions on the user distribution geometry and 
fading environment. Observe that Rr^ generally is different 
for each users, describing different spatial properties. 

To analyze the impact of imperfect CSI acquisition, we 
assume block fading where H k is static for a set of channel 
uses, called the coherence time, and then updated indepen- 
dently. We consider both frequency division duplex (FDD) 
and time division duplex (TDD); baselines of the respective 
cyclic system operations are illustrated in Fig. [2] 

In FDD systems, users acquire CSI through training sig- 
naling [18 ] and some of the users feed back quantized CSI. 
The base station then performs resource allocation (i.e., stream 
allocation and precoding design) and informs the scheduled 
users of their precoding through a second training stage. Data 
transmission follows until the end of the coherence time, when 
the cycle in Fig. |2(a)| restarts. 



I dk ) and treats inter-user interference as Gaussian 



In TDD systems, the system toggles between uplink and 
downlink transmission on the same channel, which enables 
training signaling in both directions. We assume perfect 
channel reciprocity^] and that the coherence time makes CSI 
obtained in one block of Fig. 2(b) reliable until we return to 
this block in the next cycle. The base station makes resource 
allocation decisions for both uplink and downlink, and it 
informs the users through training and control signaling. 

We assume that all training signals sent in the downlink 
directions provide the users with perfect CSI, while CSI 
feedback (in FDD) and uplink training (in TDD) might lead to 
imperfect CSI at the base station. This assumption simplifies 
analysis as it enables coherent reception, thus making the 
achievable sum rate a reasonable measure^] Observe that Fig. [2] 
only shows the main blocks of the system operations, while 
many types of control signaling are necessary in practice. 

A. Linear Precoding: General Problem Formulation 

We consider linear precoding and the transmitted signal is 



K 



(3) 



k=i 



where G (£ Nxdk is the precoding matrix, ~ 

£A/"(0, Id fe ) is the data signal, and d k is the number of 
multiplexed data streams to user k. If this user applies a 
semi-unitary receive combining matrix C k G C Mxdk (i.e., 

2 The physical channel is always reciprocal, but different transceiver hard- 
ware is typically used in the downlink and the uplink. Thus, careful calibration 
is necessary to utilize the reciprocity in practice. 

3 Many of the results herein can be extended to include imperfect CSI 
at the users in the resource allocation, followed by a second training stage 
that provides scheduled users with sufficiently accurate CSI of the precoded 
channels to enable coherent reception. See (19) for examples on this in FDD 
systems. 



noise, the achievable information rate becomes 



g k ({W £ },C k ) = log 2 



det (l dk + £ Cf H, W £ Wf Hf C k ) 

v 1=1 J _ 

det (l dk + £ Cf H, W £ Wf Hf C k ) 

v e^k J 

(4) 

where {W^} denotes the set of all precoding matrices and I 
is an arbitrary user index. The transmission is limited by an 
average power constraint of P, thus 



K 



E{x"x} = Vtr{W,Wf}<P 



(5) 



k=l 



Ideally, we would like to select the number of data streams 
d k , the precoding matrices and the receive combining 

matrices to maximize the sum rate, 



K 



maximize V" g k ( { } , C k ) 

{W k ,C k ,dk} ^ 



K 



subject to ^tr{W fc Wf } < P, 



(6) 



k=l 

CfC fc = I dfc , d k >0 Vfc. 

Unfortunately, this resource allocation problem is noncon- 
vex and NP-hard (even for N = 1, see pQ|), thus some 
simplifications are necessary to obtain practically feasible 
solutions. There are iterative algorithms that convergence to a 
stationary point of ^ (see |2T| and references therein), but the 
cyclic system operation in this paper requires a non-iterative 
approach. 

The ever-increasing demand for data traffic pushes toward 
dense cellular deployments that achieve high SNRs with 
retained power efficiency |2|. It is therefore of dominating im- 
portance to limit the inter-user interference, which we handle 
by including a constraint on zero interference between active 
users (implicitly requiring J2 k =i dk < N). Furthermore, we 
fix the receive combining matrix at some value C k , 
which makes sense from a CSI acquisition perspective as only 
the effective channel Cf H k needs to be obtained through 
feedback (in FDD) or training signaling (in TDD). As users 
are unaware of co-users, it is natural to let C k contain the d k 
strongest (left) singular vectors of — known as maximum 
ratio combining (MRC). Under imperfect CSI acquisition, C^ 
can instead be selected to improve the CSI accuracy (7), (8). 
We now have a simplified resource allocation problem, 

K 

maximize V log 2 det(I dfc + Cf H fc W fc Wf Hf C k ) 
{w k ,d k } f— ; 

k=l 



K 



subject to ^tr{W/eWf } < P, d k > Vfc, 



(7) 



k=i 



C^U k W £ = O dkXde \/k,W^k. 

When the precoding and stream allocation {W^,<i£} have 
been determined for ([7]), the users are informed of the resource 
allocation through training and control signaling. This enables 
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estimation of both the precoded channel H^Wfe and the 
second-order interference term X k = X^/c HfcW^WfH^, 
which are both necessary for coherent reception. As a nice 
by-product [8], this information enables user k to replace the 
preliminary receive combiner G k with the rate-maximizing 
MMSE receive combiner C^ MSE that contains the left singular 
vectors of (1m + Xfc) _1 HfcWfc fl2| . This improves the 
information rate by relaxing the interference nulling into a 
balancing between signal gain and interference rejection. The 
analytic results will be derived using C k , while the MMSE 
receive combiner is used in simulations. 

B. Linear Pre coding: Two Extremes 

For any feasible data stream allocation {d k }, the sum rate 
maximizing precoding in ^ is achieved by classic SVD- 
based precoding (including water-filling for power allocation) 
(5). The optimal solution to ([7]) can thus be obtained through 
exhaustive search over all possible stream allocations, but it 
quickly becomes infeasible as N and K increase. Fortunately, 
greedy algorithms for finding the scheduling set S of users 
with dk > can perform remarkably close to optimum; see 
|9|-p2|. As greedy algorithms are hard to analyze mathemat- 
ically, the analytic results in this paper compares two extremes 
in the stream allocation: block-diagonalization (BD) (5) and 
zero-forcing with combining (ZFC) (7J, (8). More flexible 
allocations are considered in the simulations. 

Definition 1. (Block-Diagonalization Precoding) Let <S BD be 
a scheduling set with at most users. For each user k G 
<S BD , we set d k = M and W k = W BD T^ 2 , where W BD is 
a semi-unitary matrix that satisfies W^ D ' W BD = 1m and 
H^Wf = for all i G <S BD \{/c}. The power allocation is 
given by the diagonal matrix T k >z Om- The information rate 
is 

qT(P) = log 2 det (l M + H fc WfT fc Wf '^Hf ) . (8) 

Definition 2. (Zero-Forcing Precoding with Combining) Each 
user combines its antennas using some channel-dependent 
unit-norm vector c k e C Mxl . Based on the effective channels 
hf = cf H k G € lxN , a scheduling set <S ZFC with at most 
N users is selected. For each user k G <S ZFC , we set d k = 1 
and let W k = y/PkW^ , where w| FC is a unit-norm vector 
that satisfies hf w ZFC = for all £ G S ZFC \{k}. The power 
Pk > is allocated to user k and the information rate is 

5 f c (P) = log 2 (l+p fe |hfwF c | 2 ). (9) 

The sum-rate maximizing power allocations for BD and 
ZFC are achieved through water-filling (see [5]), but the 
asymptotic analysis in this paper often assumes equal power 
allocation (i.e., T k = M \s^\ 1m ^ k e S BD and Pk = 
I^fcI Vfc G <S ZFC ) since this becomes optimal in the high SNR 
regime (P — » oo (22)) where the system is interference-limited 
(and the effect of imperfect CSI is more pronounced) — this is 
practically relevant in dense cellular networks |2|. While the 
definitions of BD and ZFC assume perfect CSI, both strategies 
can be applied when the transmitter has imperfect CSI by 
pretending that the available CSI is perfect (6j-(8). 



Both BD and ZFC precoding are designed to create zero 
inter-user interference, with the difference that ZFC only sends 
one data stream per scheduled user while BD selects fewer 
users but multiplexes M streams to each of them. BD and ZFC 
are identical when each user only has one antenna (i.e., M = 
1). However, this does not mean that BD is a generalization of 
ZFC; there are good reasons for applying ZFC when M > 1: 

1) The base station only needs to acquire the effective 
channels h^; 

2) The effective channels h k have better properties than 
H/e and can be adapted for interference rejection; 

3) User devices require simpler hardware that only decodes 
one stream. 

The interference mitigation is, on the other hand, less 
restrictive under BD since fewer users are involved and the 
mutual interference between streams sent to the same user 
is handled by receive processing [6]. This paper compares the 
performance of ZFC and BD under both perfect and imperfect 
CSI. The results will provide an answer to the fundamental 
question of how to divide the available resources among 
multi-antenna users: should we select many users and enable 
receive combining or select few users and exploit multi- stream 
multiplexing? 

Remark 1 (Ambiguous Terminology). The terminology block- 
diagonalization and zero-forcing have been given different 
meanings in prior work. Herein, BD refers to the original 
work in (5), where each active user receives exactly M data 
streams. Apart from the ZFC strategy in Definition [2] (and 
in (7), [8]), another downlink zero-forcing strategy for multi- 
antenna users was proposed in (23j. In their definition, each 
antenna at the multi-antenna users is viewed as a separate 
virtual single- antenna user and the zero-forcing idea is applied 
to send a separate stream to each antenna with zero inter- 
antenna interference. That approach is nothing else than BD 
with stricter interference mitigation and can never perform 
better than BD. Herein, ZFC means sending one stream per 
user and utilizing receive combining, thus ZFC is not a special 
case of BD and can hypothetically outperform BD. 

III. Comparison of BD and ZFC with perfect CSI 

In this section, we will compare BD and ZFC in the ideal 
scenario when both the base station and the users have perfect 
CSI. We derive analytic results showing the impact of different 
system properties. Under perfect CSI, the achievable sum rate 
in ^ asymptotically becomes (as P — ^ oo) (22) 

OP) - iVlog 2 (^)+ £ log 2 det(H fc Wf Wf'^Hf ), 

^ ' kes BD 

f^(P) = Nlog 2 (^)+ log 2 (|hfwf c | 2 ), 

(10) 

for BD and ZFC, respectively. This result is based on having 
scheduling sets that satisfy |<S BD | = ^ and |<S ZFC | = TV and 
on equal power allocation (which is asymptotically optimal). 

For both strategies, the asymptotic sum rate behaves as 
«A4oolog 2 (P) + 7^oo, where Moo is the multiplexing gain 
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and TZoc is the rate offset. Both BD and ZFC achieve a 
multiplexing gain of .Moo = N, which is the same high-SNR 
slope as of the sum capacity. We thus need to compare the 
rate offsets 71^ to conclude which strategy is preferable. 

Theorem 1. Assume that the receive correlation matrices Rr^ 
have eigenvalues Xk,M > ••• > ^k,i > and random 
user selection with |<S BD | = §, \S ZFC \ = N. The expected 
asymptotic difference in sum rate between BD and ZFC is 



/3bd-zfc = lim E{/ BD (P) - f^(P)} 

P^OO 



< N 



log 2 (e ) ^ M 
M 



log 2 



n 



ies ZFC ■ 



(ID 

Proof: The proof is given in Appendix [B] ■ 
The expected asymptotic difference in Theorem [T] consists 
of two terms. The first term is positive and is an upper bound 
on the expected gain of BD in an uncorrected system with 
homogenous users (cf. [22, Theorem 3]). The second term 
contains a ratio of eigenvalues, belonging to different users. 

For users with homogenous channel conditions where all 
Rr,/c have the same eigenvalues Xk,m = A m , the last term in 
(TTj is 



i-rM *1/M 

Nlog 2 [[m = lAm <Q 



A 



M 



(12) 



which contains the geometric mean of all eigenvalues divided 
by the largest eigenvalue. This ratio is smaller than one 
(or equal for uncorrected channels) and thus its logarithm 
is negative and approaches — oo as the eigenvalue spread 
increases. In other words, Theorem [T] shows that BD has a 
(bounded) advantage on uncorrected channels, while ZFC be- 
comes the better choice as the receive- side correlation grows. 
The explanation is that BD has less restrictive interference 
mitigation, but is more sensitive to poor channels since it uses 
all channel dimensions for transmission. Therefore, we can 
expect a similar impact of any channel property that increases 
the eigenvalue spread in H/cH^. This could for example be 
spatial correlation at the transmitter-side or a strong (low-rank) 
line-of-sight component. 

To illustrate the opposite effect of having users with dif- 
ferent path losses, we assume for simplicity that there are |j 
strong users with Rr^ = jIm, for some 7 > 1, and N — 
weak users with Rr^ = 1m- If BD only serves the strong 
users while ZFC serves also the weak users, the last term in 
( fTl) becomes 

-Jlog 2 ( 7 )>0. (13) 



N ■ 



This expression approaches +00 as the difference 7 between 
the strong and weak users grows. In other words, BD is better 
at utilizing heterogenous user channel conditions as it requires 
fewer users to be close to the base station to achieve high 
sum rates. This benefit will naturally diminish if some fairness 
mechanism is introduced to guarantee quality of service to 
users with unfavorable path losses. 

The expected asymptotic difference in sum rate, /3bd-zfc> 
can be transferred into a difference — io7vT^f^(2) t^B] in 



transmit power to achieve the same sum rate in the high-SNR 
regime (22). Theorem [T] shows that ZFC requires more power 
than BD in uncorrelated scenarios or highly heterogenous 
user conditions, while BD requires more power under spatial 
correlation with relatively homogenous user conditions. 

A. Impact of User Selection 

The comparison in Theorem [T] was based on random user 
selection of the maximal number of users (§ with BD and N 
with ZFC), although scheduling of spatially separated users is 
necessary to achieve the full potential of multi-user MIMO. 
This paper assumes K > N users, meaning that only a 
subset of users is scheduled at each channel use. If the users 
are unevenly distributed in the cell, it could be beneficial 
to intentionally schedule fewer users than possible. We will 
now analyze how the ability of selecting users with spatially 
compatible channels impacts performance. 

In the high SNR regime, the optimal (semi-unitary) pre- 
coding matrix W^ u for single-user transmission matches the 
channel as C^HfcW| u = C^H^, while the precoding matrix 
Wfc e C Nxdk of an SDMA scheme is also adapted to the 
co-user channels. The expected asymptotic performance loss 
of having to cancel inter-user interference is therefore 

E{Loss} = E{log 2 det(Cf H fc Hf C k ) 

- log 2 det(Cf H,W fe Wf Hf C,)} 

= eI k* **(a*a") ] (14) 

I ^ 2 det(A fe B fc W fc W fe Bf Af ) J 
= -E{log 2 det(B fe W fe W fc Bf )} 

where A^ G ^d k xd k con t ams me non-zero singular values 
of Cj^Hfc and B& contains the corresponding right singular 
vectors]^] Observe that the eigenvalues of BfeWfeWfeB^ are 
smaller or equal to one, thus E{Loss} > 0. The following 
theorem shows how this loss is affected by user selection. 

Theorem 2. For any given scheduling sets <S BD ,<S ZFC (with 
|<S BD | = H and |<S ZFC | = N), suppose we replace one of the 
users in each set with the best one among K random users. If 
the best user is the one minimizing the expected asymptotic 
loss in ( [14] ), these losses for BD and ZFC, respectively, can 
be lower bounded as 



E{Loss BD } > -Mlog 2 (l - c ± K m(n-m) 



E{Loss ZFC } > -log 2 (l - c 2 K~ 



(15) 



when K is large (ci, c 2 are positive constants, see the proof). 

Proof: The proof is given in Appendix [C] ■ 
This theorem indicates that it is easier to find users with 
near-orthogonal channels under ZFC than under BD, which is 
reasonable since the random channels of BD users occupy M 
dimensions and should happen to be compatible to the other 
users in all these dimensions, while ZFC users only have one 
random dimension (and can adapt it using receive combining). 

4 These matrices can be obtained from a compact singular value decompo- 
sition C^Hfc = UfeAfcBfc. Note that contains an orthonormal basis of 
the row space of the effective channel C?Hfc. 



6 



IEEE TRANSACTIONS ON SIGNAL PROCESSING 




■ Rx-Corr (Bound) 
- Rx-Corr 

Tx-Corr 

Both 



0.2 



0.4 0.6 
Spatial Correlation Factor 



0.8 



Fig. 3. The expected asymptotic difference between BD and ZFC in a 
system with N = 8 transmit antennas, M = 2 receive antennas per user, 
and random user selection. The impact of spatial correlation at the receiving 
users, transmitting base station, and both sides is shown (using the exponential 
correlation model from |24 1 with different correlation factors p). 



B. Numerical Illustrations under Perfect CSI 

Next, the analytic properties in Theorem [T] and Theorem [2] 
are illustrated numerically. To this end, we adopt the simple 
exponential correlation model of [24], where < p < 1 and 



Z7[0,2tt). (16) 



The magnitude p is the correlation factor between adjacent 
antennas, where p = means no spatial correlation and 
p = 1 means full correlation. For simplicity, all users have the 
same correlation factor (but 6 is different). It is worth noting 
that p impacts the perceived spatial correlation non-linearly; 
a typical angular spread at a highly spatially correlated trans- 
mitter/receiver is 10-20 degrees which roughly corresponds to 
p^0.9(25j. 

The expected asymptotic difference between BD and ZFC 
is shown in Fig. [3] as a function of p, using N = 8 transmit 
antennas and M = 2 receive antennas. This simulation con- 
firms that BD is advantageous in uncorrected systems, while 
ZFC becomes beneficial as the correlation increases (p > 0.4 
under receive-side correlation, p > 0.7 under transmit- side 
correlation, and p > 0.25 when both sides are correlated). 
The bound in Theorem [T] is only tight at high correlation. 

To exemplify the impact of user selection, we use the 
capacity-based suboptimal user selection (CBSUS) algorithm 
from (26), which greedily adds users sequentially to maximize 
the sum rate and might give scheduling sets with fewer than N 
data streams. We consider a scenario with N = 8 uncorrected 
transmit antennas and M = 4 receive antennas with correlation 
factor p G {0,0.4,0.8}; see (27) for another scenario. We 
compare ZFC (1 stream/user) and BD (4 streams/user) with 
multi-user eigenmode transmission (MET) from [10] where 
data streams are allocated greedily with zero inter-user inter- 
ference and users can have different number of streams. We 
also simulated 2 streams/user, but the result was always in 
between ZFC and BD is therefore not shown. Observe that all 
strategies might transmit fewer than N data streams. 

Fig. [4] shows the average achievable sum rate as a function 
of the total number of users K. We consider the case when 
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Fig. 4. The average achievable sum rate in a system with perfect CSI, N = 8 
transmit antennas, M = 4 receive antennas, and the same average SNR among 
all users (10 or 20 dB). The performance with different strategies are shown 
as a function of the total number of users and for different correlation factors 
p among the receive antennas. 



all users have the same average SNR (defined as } ), 
either equal to 10 or 20 dB. Irrespectively of the SNR, num- 
ber of users, and receive-side correlation, ZFC outperforms 
BD. Thus, the scheduling-benefit of ZFC (from Theorem 
[2]) dominates over the interference mitigation-benefit of BD 
(from Theorem [TJ — even for spatially uncorrelated channels. 
As expected, the performance with ZFC improves with p, 
while correlation degrades the BD performance. MET has an 
advantages over ZFC since it can allocate different numbers of 
streams to different users (based on how many singular values 
are strong in their channels), but this advantage is small and 
disappears asymptotically with the number of users (as also 
noted in (TO)). 

Next, we consider heterogeneous conditions by having 
uniformly distributed users in a circular cell with radius 250 
m (minimal distance is 35 m), a path loss coefficient of 3.5, 
and log-normal shadow-fading with 8 dB standard deviation. 
The average achievable sum rate is shown in Fig. [5] with an 
SNR of 20 dB at the cell edge|^The variable path loss makes 
the results very different from the previous scenario. At low 
receive-correlation, BD outperforms ZFC, but the difference 
reduces with K. ZFC is however better than BD at high 
correlation and many users. MET has a large advantage over 
the other strategies, explained by its flexible stream allocation. 
To comprehend the difference, the probability that a selected 
user is allocated a certain number of streams is illustrated in 
Fig. [6] Spatial correlation reduces the number of streams per 
user, but the distance-dependence is even more significant; 
cell center users usually receive many streams while cell edge 
users only receive one or a few streams. This is natural since 
cell center users are more probable to have multiple relatively 
strong singular directions. 

The conclusion is that ZFC is the method of choice in 
multi-user MIMO systems with perfect CSI and homogenous 

5 Such SNRs are reasonable in dense cellular systems and are necessary to 
compare BD and ZFC in regimes where these are supposed to work well. 
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Fig. 5. The average achievable sum rate in a circular cell with perfect CSI, 
N = 8 transmit antennas, M = 4 receive antennas, and an SNR of 20 dB 
at the cell edge. The performance with different strategies are shown as a 
function of the total number of users and for different correlation factors p 
among the receive antennas. 
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user conditions (since it performs very closely to the more 
complicated MET). On the other hand, MET and BD are 
better under heterogeneous user conditions. It is worth noting 
that the more streams allocated per user, the more channel 
dimensions need to be know at the base station. The next 
section will therefore study how practical CSI acquisition 
affects our results. 

IV. Comparison of BD and ZFC with Imperfect CSI 

In this section, we continue the comparison of BD and 
ZFC by introducing imperfect CSI, originating from either 
quantized feedback in an FDD system or imperfect reverse- 
link estimation in a TDD system. The resources for channel 
acquisition are limited which has a major impact on both 
the number of channel dimensions that can be acquired per 
user and the accuracy of the acquired CSI. Theoretically, 
users can feed back different number of channel dimensions 
depending on some kind of long-term statistical CSI, but that 
would reduce the coverage (by favoring cell center users) and 
require a flexible system operation with additional control 
signaling. We therefore assume that the system acquires d 
dimensions/user from a randomly selected user set, where 
d > 1 is fixed but depends on the intended precoding strategy. 
This assumption is relaxed in the numerical evaluation. 

A. Comparison with Quantized CSI 

In the FDD system operation of Fig. |2(a)| each user selected 
for feedback conveys the d-dimensional subspace spanned by 
its effective channel Cj*Hk using B bits. Similar to |6J, 
j28j-|3TJ, we use a codebook Cn4,b = {Ui, . . . , U 2 b} with 
codewords G £ Nxd from the (complex) Grassmannian 
manifold Gn,<i\ that is, the set of all d-dimensional linear 
subspaces (passing through the origin) in an iV-dimensional 
space. Each codeword forms an orthonormal basis, thus is 
a semi-unitary matrix satisfying XJfXJi = 1^. User k selects 
the codeword that minimizes the chordal distance (32): 



arg mm 



(cfH fc ,u) 



(17) 



where <5(B,U) = yjd - tr{span{B}^UU^span{B}} and 
span{-} gives a matrix containing an orthonormal basis of the 
row space. We assume error- free and delay-free feedback, but 
the conclusions of this section are expected to hold true also 
under feedback errors (cf. [19]). 

There is a variety of ways to handle feedback errors (espe- 
cially if the error structure is known), but a simple approach 
is to treat as being the true channel |6J and calculate 
the precoding using a strategy developed for perfect CSI. 
This results in a lower bound on the performance and the 
information rates with BD and ZFC becomes 



det I 



A I 



log 2 



f £ H fe W? D T,W : 

ees BD 



r*Hf) 
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£es BD \{k} 



BD 



log 2 1 



ZFC 1 2 



f E PA 

£es ZFC \{k] 



cfH fc w; 



ZFC 1 2 



(18) 
(19) 



for users in the scheduling sets <S BD and <S ZFC , respectively. 

Next, we quantify the performance loss for BD and ZFC 
compared with having perfect CSI. Random vector quanti- 
zation (RVQ) is used for analytic convenience (as in (6|, 
(7), (33) , (34)), meaning that we average over codebooks 
with random codewords from the Grassmannian manifold. As 
any judicious codebook design is better than RVQ, the upper 
bounds on the performance loss that we will derive are valid 
for any reasonable codebook. The following theorem provides 
an upper bound on the performance loss under BD and extends 
results in [6] to include heterogeneous user conditions and 
spatial correlation. 

Theorem 3. Assume that users are scheduled randomly. 
The average rate loss with BD (using equal power allocation) 
for user k G <S BD due to RVQ is upper bounded as 

< log 2 det ( I M 



A BD-Q _ 



9l D - Q (P)} 



D BD R 



M 



R,k 



(20) 
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where the average quantization distortion is 



and the average channel gain with QBC is (where ii n = y^) 



M-l m 



A J 



D BD = E{5 2 (H k ,U k )} 

r (m(7V-M)) / 2 B 



M /A7 M(N-M) 



M(N-M) UM(N - M))\ fJ[{M 



(21) 



Proof: The proof is given in Appendix |D| ■ 

This theorem will be compared with the corresponding 
result for ZFC, but before stating that theorem we discuss 
how to select the (preliminary) receive combiner There 
are primarily two factors to consider when selecting c^: the 
gain of the effective channel ||c^Hfc ||| and the quantization 
distortion. The results of (35|, (36) indicate that the top priority 
in multi-user MIMO systems is to achieve small quantization 
errors. The error can be minimized by the quantization-based 
combining (QBC) approach in 17), where the codeword and 
receive combiner are selected jointly as 



(c^ BC , hfc) = argmax 5 (Hf c, u) 

c:||c|| 2 = l 



(22) 



The maximum expected SINR combiner (MESC) in (8) 
achieves better practical performance by balancing effective 
channel gain and quantization distortion, but is asymptotically 
equal to QBC at high SNR. Since this is the regime of interest 
herein, we will exploit the analytic simplicity of QBC. Observe 
that QBC and MESC are only used for improved feedback 
accuracy; 



the MMSE combiner in Section II-A is used to 



maximize the performance during transmission (this was not 
done in the original QBC framework of (71). 

The following theorem provides an upper bound on the 
performance loss under ZFC and extends results in (7) to 
include heterogeneous user conditions and spatial correlation. 

Theorem 4. Assume that Rr,& has eigenvalues Xk,M > • • • > 
A/c,i > and that N users are selected randomly. The average 
rate loss for ZFC (using equal power allocation and the same 
c^ BC ) due to RVQ is upper bounded as 



a zfc-q 
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where the average quantization distortion is 
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Proof: The proof is given in Appendix [E] ■ 
The rate loss expressions in Theorem [3] and Theorem [4] for 
BD and ZFC, respectively, indicate the joint impact of spatial 
correlation (at the receiver) and CSI quantization on the per- 
formance. The main observation is that spatial correlation only 
has a marginal effect on the feedback accuracy; the expressions 
have a similar structure as for uncorrelated channels and the 
same scaling in the number of feedback bits is necessary to 
achieve the optimal multiplexing gain (6), (7). 

Corollary 1. To achieve the optimal multiplexing gain with BD 
or ZFC under quantized CSI and arbitrary receive correlation, 
it is sufficient to scale the total number of CDI feedback bits 
for the scheduled users as 



B tot ^N(N-M) log 2 (P) + 0(l). 



(26) 



While this corollary only provides a sufficient condition, 
we can expect the scaling law in ( [26] ) to also be necessary]^] 
It might seem unreasonable that the number of feedback bits 
should approach infinity with the SNR, but if we can achieve 
the optimal multiplexing gain in the downlink it is typically 
achievable also in the uplink (T9). Therefore, the uplink sum 
rate behaves as N log 2 (P)+0(l) and it is sufficient to allocate 
(approximately) N — M channel uses for CSI feedback. 

Observe that this result is based on random user selection, 
while additional feedback of gain information is necessary 
to achieve multi-user diversity or short-term rate adaptation 
(cf. (38)). As BD requires M times more bits per user, ZFC 
can typically achieve feedback from M times more users. 
We therefore expect ZFC to further strengthen its advantage 
at finding near-orthogonal users (proved in Theorem [2] under 
perfect CSI). In addition, spatial correlation at the transmitter- 
side (and other factors that make the channel matrices ill- 
conditioned) will inflict larger performance losses on BD than 
ZFC, just as in the case of perfect CSI. 

B. Comparison under Estimated CSI 

Next, we assume that the base station acquires CSI through 
imperfect CSI estimation. The primary focus will be on TDD 
systems, where channel estimates are obtained through training 

6 The necessary scaling can be proved for ZFC with QBC using a technique 
from (37] Theorem 4], while simulations in | 6 1 shows that quantized ZFC and 
BD has the same scaling in the necessary number of bits. 
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signaling in the uplink (assuming perfect channel reciprocity). 
But it is worth noting that this approach is similar to having 
analog CSI feedback in FDD systems, where the unquantized 
channel coefficients are sent on an uplink subcarrier jSj, fl^H 
The reciprocal uplink counterpart to the system model in 
([2J is 

y k = H^x/e + fife (27) 

where G C Nxl is the received uplink signal, G C Mxl 
is the transmitted uplink signal, and ~ £/V(0, ct 2 Iat) is 
the noise vector]^] To estimate C^H/e G C dxN , user fc sends 
C^T/c over d uplink channel uses, for some known training 
matrix T*. G C dxd and where (•)* denotes complex conjugate. 
Assuming perfect statistical CSI, the MMSE estimate of 
C^H/e and the corresponding error covariance matrix 
are iflt 



vec(H^) = ^E,Tf vec(Y fc ), 



TfT fc 

a 2 



(28) 



where = (T^ ® Ijv) and is the received signal from 
training signaling. The training matrix has a total training 
power constraint tr{T^T/e} = \£. 

As under quantized CSI, we calculate the precoding by 
treating H/e as the true channel. This results in a lower bound 
on the performance and the information rates with BD and 
ZFC becomes 



^ D - EST (^)=log 2 



det(l M + E H fe W? D ^Wf ' H Hf) 
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(29) 
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(30) 



for users in the scheduling sets <S BD and <S ZFC , respectively. 
The following theorem provides an upper bound on the per- 
formance loss under BD due to imperfect CSI estimation. 

Theorem 5. Assume that users are scheduled randomly 
under BD. The average rate loss for user k G <S BD (using equal 
power allocation) due to CSI estimation is upper bounded as 

A BD = E{^ D (P)-^ D " EST (P)} 

P(N - M) 



< log 2 det I 



M 



N 



K R T k 



(31) 



7 Digital/quantized feedback might be beneficial over analog/unquantized 
feedback when there is plenty of resources for channel estimation (19) . But 
if very accurate CSI is required, Corollary ^ shows that the quantization 
codebooks grow very large and thus the search for the best codeword might 
be computationally infeasible. 

8 The downlink noise vector was normalized towards the channel matrix in 
the system model. To account for a different noise level at the base station, 
cr 2 is the (relative) uplink noise variance. 



Proof: The proof is given in Appendix [F] ■ 
This theorem will be compared with the corresponding 
result for ZFC, but before stating that theorem we need to 
consider the impact of having MRC as the receive combiner 
ZFC is similar to applying BD to the effective channels 
= cj^H/c, but an important difference is that the effective 
channels are not Rayleigh fading (since depends on the cur- 
rent channel realization). The expression in ( [28] ) will therefore 
not give the MMSE estimate, but fortunately the linear MMSE 
(LMMSE) estimator from a similar expression to ( [28] ) if we 
know the first two moments of p8| . 

Lemma 1. Assume that Hr has eigenvalues Am > . . . > Ai > 
0, where the user indices were dropped for convenience. If c 
is the dominating left singular vector of H, it holds that 

• the direction of h = c^H is isotropically dis- 
tributed on the unit sphere; 

• the gain 1 1 lex 1 1 ^ is independent of the direction and 
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In ([32]), the set of all permutations of {1, . . . , M} is denoted 
Am- The sign of a given permutation £ = {Ci, • • • , Cm} £ 
Am is denoted (— l)P er (0, where per(-) is the number of 
inversion^] in the permuted sequence. Next, B^m is the col- 
lection of all subsets of Am with cardinality I and increasing 
elements (i.e., fa < . . . < fa for (3 = {fa, . . . , fa} e B hM \ 
The upper bound in the summation over t is K\(J5) = 
Yl\=i(N — fa). Finally, is the set of all /-length partitions 



{&!,..., M of £ (i.e., £• 

N - fa: 



(0 



i 



i) that satisfy < h < 



£,0<kj<N- fa Vj 



(34) 



Proof: The proof is given in Appendix [G] ■ 
The following theorem provides an upper bound on the 
performance loss under ZFC due to imperfect CSI estimation. 

Theorem 6. Assume that N users are scheduled randomly 
under ZFC and that MRC is applied. The average rate loss 
for user k G <S ZFC due to CSI estimation is upper bounded as 

A ZFC-EST = E { 5 f C (P) _ 5 ZFC-EST (j p )} 

P(N ■ 



< log 2 1 
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N E{||h fc ||l}- 



(35) 



9 An inversion in a sequence is a pair of numbers that are in incorrect order 
(i.e., not in ascending order). 
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Proof: The proof is given in Appendi x |H| ■ 
The rate loss expressions in Theorem [3]~and Theorem [6] 
indicate the joint impact of spatial correlation and channel 
estimation on the performance of BD and ZFC, respectively. 
BD is slightly more resilient to CSI uncertainty, since the BD 
expression contains (N — M) where the ZFC expression has 
(TV— 1). But observe that the performance losses are calculated 
against the same precoding strategy with perfect CSI; we know 
from Section [III] that ZFC and BD have different preferable 
user conditions, making it hard to analytically conclude which 
strategy to use under imperfect CSI estimation. However, the 
important result is the following extension of p9| to spatially 
correlated scenarios with M > 1. 

Corollary 2. To achieve the optimal multiplexing gain with BD 
or ZFC under imperfect CSI estimation and arbitrary receive 
correlation, it is sufficient to scale the training power \I> as 

P 

— — > constant < oo when P — » oo. (36) 

This corollary says that the training power should increase 
linearly with the transmit power to achieve the optimal sum 
rate scaling. This is for example satisfied by setting the total 
training power to \I/ = P under ZFC and \I/ = MP under 
BD, which corresponds to the reasonable assumption of having 
the same average SNR in the downlink and in the uplink p| 
The demands for higher CSI accuracy with increasing SNR 
is therefore automatically fulfilled by the reduced estimation 
errors. Observe that one uplink channel use is consumed 
per user antenna dimension that is estimated, thus creating 
a practical bound on how many user channels that can be 
estimated in block fading systems fT9| . As ZFC only has one 
effective antenna per user, it can accommodate M times more 
users than BD on the same estimation overhead and thereby 
exploit multi-user diversity to a larger extent. 

V. Numerical Illustrations under Imperfect CSI 
This section consists of two parts. First, the numerical 



illustrations in Section III-B are continued under imperfect 
CSI estimation. Then, we analyze the performance behavior 
under quantized CSI. 



A. Continuation of Section III-B under Estimated CSI 



We continue the simulations in Section III-B by introducing 
imperfect CSI estimation. We use MSE-minimizing training 
matrices from [18, Theorem 1] and training power \I/ = dP 
(for estimation of d dimensions/user). The CBSUS algorithm 
in (26j is modifiecp] to include the average interference (due 
to CSI estimation errors) in the scheduling. 

The average achievable sum rate is shown in Fig. [7] as a 
function of the number of users that we obtain CSI estimates 
for using ZFC (while BD only obtains channel estimates for 

10 Battery-powered user devices might operate at lower power than the 
base station, but Corollary [2] is satisfied as long as P and ^ exhibit the 
same scaling. In practical scenarios, the path loss is the main source of SNR 
variations and affects the downlink and uplink equally. 

11 Estimation errors contribute with an average interference of P(\S\ — 
l)/|5|E es t, where E est = (R~^ + Tf T^/a 2 )- 1 for BD and E est = 
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Fig. 7. The average achievable sum rate in a system with CSI estimation 
errors, N = 8 transmit antennas, M = 4 receive antennas, and the same 
average SNR among all users (10 or 20 dB). The performance with different 
strategies are shown as a function of the total number of users and for different 
correlation factors p among the receive antennas. 



-jjj of them). All users have the same average SNR of either 10 
or 20 dB. The performance loss compared with having perfect 
CSI is 10-20% (see Fig. [4]), but the conclusion is otherwise 
the same and even clearer than before: ZFC outperforms BD 
in terms of performance with few users, in handling spatial 
correlation, and in exploiting multi-user diversity. 

In case of a circular cell (see Section |III-B| for details), the 
average achievable sum rate is shown in Fig. [8] Recall from 
Fig. [5] that BD was often better than ZFC in this scenario 
under perfect CSI, but the case is completely different under 
imperfect CSI; ZFC outperforms the other strategies when the 
limited resources for CSI acquisition are taken into account. 
The explanation is that the ZFC benefit of easily finding 
near-orthogonal users (among M times more users than with 
BD) dominates the BD benefit of multi-stream multiplexing 
(preferably to cell center users). We also tested a MET-like 
scheme with greedy stream allocation (we took the optimum 
among feeding back 1, 2 or 4 channel dimensions per active 
user), but it was always identical to ZFC (in both Fig. [4] and 
Fig. [5} — this further confirms our conclusion. 

The users selected for feedback were chosen randomly 
(e.g., in a round-robin fashion) in Figs. [7] and [8j but could 
theoretically be based on some kind of long-term statistical 
CSI. This could for instance mean that ZFC acquires one 
dimension from each of the K users, while BD acquires M 
dimensions from the || users with the strongest long-term 
statistics trjRx^jtrjRR^}. The greedy stream allocation 
strategy MET in |T0| can be generalized to this scenario by 
finding the K strongest statistical eigendirections among the 
users and acquire CSI for an equivalent number of dimensions 
per user. Under these assumptions, the average achievable sum 
rate for the circular cell is shown in Fig. [9] The performance 
behavior is quite similar to the case with perfect CSI in Fig. [5| 
BD is better than ZFC, except at high correlation, and there is 
large gain from greedy stream allocation. However, we stress 
that this scenario is unrealistic as CSI is only acquired for cell 
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Fig. 8. The average achievable sum rate in a circular cell with CSI estimation 
errors, N = 8 transmit antennas, M = 4 receive antennas, and an SNR of 
20 dB at the cell edge. The performance with different strategies are shown 
as a function of the total number of users and for different correlation factors 
p among the receive antennas. 
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Fig. 9. The average achievable sum rate in a circular cell with CSI estimation 
errors, N = 8 transmit antennas, M = 4 receive antennas, and an SNR of 
20 dB at the cell edge. The performance is shown as a function of the total 
number of users K, and CSI is only acquired for the users with strongest 
long-term statistics. 



center users, thus reducing the coverage and user fairness as 
cell edge users are not even selected when their channels are 
relatively strong. 

B. Observations under Quantized CSI 

Next, we consider quantized CSI and let the number of 
feedback bits (per channel dimension) be scaled as (N — 
M) log 2 (P) — constant, where the constant is selected as in 
(6] Eq. (17)] to maintain a 3 dB gap between BD with perfect 
and quantized CSI. We consider N = 4 transmit antennas, 
M = 2 receive antennas, and RVQ. We also modify^] the 
CBSUS algorithm in |26| to include the average interference 
due to quantization. 

First, we compare BD (having either quantized or perfect 
CSI) with quantized ZFC using MESC-MMSE combining (SJ 

12 Quantization errors contribute with an average interference of P(\S\ — 
l)/|5|E qu ant, where E quant = N/(M(N - M))D BD R R)fc for BD and 
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5 10 
Average SNR [dB] 



15 



Fig. 10. The average achievable sum rate with BD and ZFC, quantized CSI 
feedback, N = 4 transmit antennas, M = 2 receive antennas, uncorrected 
channels, and varying SNR. The number of feedback bits is scaled with the 
transmit power according to Corollary ^ and | 6 Eq. (17)]. 



20 



15 



Single-User Transmission 
BD 

-ZFC (QBC+MMSE) 
■ ZFC (QBC) 



Feedback from 3 users, 10 bits/user 




15 20 
Average SNR [dB] 

Fig. 11. Comparison of single-user transmission, BD, and different forms 
of ZFC under quantized CSI feedback. The scenario is the same as in |6 
Fig. 6], where the superior single-user strategy was not included. 



and with single-user SVD-based transmission (to a randomly 
selected user). The quantized effective channels are obtained 
from 8 users under ZFC, while the entire channels are quan- 
tized for 4 users under BD. The average achievable sum rate 



is shown in Fig. [TO] as a function of the average SNR. At 
low SNR, quantized BD only selects one user and performs 
similar to single-user transmission. As two data streams are 
transmitted to the selected user, both strategies are slightly 
better than ZFC in this regime. But quantized ZFC quickly 
improves with SNR and becomes the method of choice at 
practical SNRs. The simulation was stopped at P = 14.3 dB 
where BD requires feedback of 22 bits per user, meaning that 
the best codeword is selected in a codebook with over a million 
entries [^]BD is therefore suboptimal both in terms of sum rate 
and computational complexity. 

This observation stands in contrast to the numerical results 
in (6), where BD clearly beats ZFC under quantized CSI. 
To explain the difference, we repeat the simulation in (6] 
Fig. 6] with N = 6 transmit antennas and M = 2 receive 
antennas. In this simulation, the RVQ codebooks contain 10 
bits/user under BD and 5 bits/user under ZFC. The achievable 



sum rate is shown in Fig. 11 for the quantized BD approach 



BD uses M times more feedback bits per user than ZFC. 



13 An approach to emulate RVQ for very large random codebooks was 
proposed in |6|, but this does not change the fact that the quantization 
complexity becomes infeasible much faster under BD than under ZFC. 
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in (6) and the ZFC-QBC approach in (7). We have also 
included: 1) an improved version of ZFC-QBC where the 
MMSE receive combiner is applied during transmission; and 
2) single-user SVD-based transmission to a randomly selected 
user. Our simulation confirms that BD is better than ZFC 
in this scenario, but the difference becomes much smaller 
when the MMSE combiner is applied. However, none of these 
strategies should be used in this scenario since single-user 
transmission is vastly superior. The explanation is that the 
number of feedback bits is fixed at a number that only exceeds 
the feedback scaling law in ( [26] ) and |6, Eq. (17)] at low SNR 
(cf. (6] Fig. 2]), while the strict interference mitigation in BD 
and ZFC is only practically meaningful at high SNR. The 
observation in [6] is thus misleading. 

Conclusions from the mathematical and numerical analysis 
are summarized in the next section. 



VI. Conclusion 

This paper analyzed how to divide data streams among 
users in a downlink system with many multi-antenna users; 
should few users be allocated many streams, or many users be 
allocated few streams? New and generalized analytic results 
were obtained to study this tradeoff under spatial correlation, 
user selection, heterogeneous user channel conditions, and 
practical CSI acquisition. 

The main conclusion is that sending one stream per selected 
user and exploiting receive combining is the best choice under 
realistic conditions. This is good news as it reduces the hard- 
ware requirements at the users, compared with multi-stream 
multiplexing. The result is explained by a stronger resilience 
towards spatial correlation (Theorem [TJ and larger benefit 
from user selection (Theorem [2|. To arrive at alternative 
conclusions, one has to consider a scenario with heterogeneous 
user conditions with either perfect CSI (unrealistic) or where 
CSI is only acquired for the strongest users (destroys coverage 
and fairness). It should however be noted that if only very 
inaccurate CSI can be acquired, then inter-user interference 
will limit performance thus making single-user transmission 
advantageous. 



Appendix A 
Collection of Lemmas 

This appendix contains two lemmas that are essential for 
proving the theorems of this paper. The first result shows how 
spatial correlation at the receiver affects the channel directions. 

Lemma 2. Let A y Om be any Hermitian positive-definite 
matrix and let H ^ C MxAr be an arbitrary matrix. Then, 
span(H) = span(AH), where span(-) denotes the row space. 

Proof: Let A = U^A^U^ be an eigen decomposition 
of A. The lemma follows by observing that only rotates 
the basis vectors of the row space and A^ scales the rows 
without affecting their span. ■ 
The second result generalizes the bounding of performance 
loss under imperfect CSI in (6), (7). 



Lemma 3. Let be isotropically distributed on the 

Grassmannian manifold GN,d k and independent of then 

P 



E{log 2 det(l dfc 



N 



CfH fc W fc W; 



-E < 



det(l dfc + ^ECf H fc W,Wf Hf C k ) > 

log 2 



det (ld fc + ^E Cf H fc W< Wf Hf C fc 

tj^k 

P 



< log 2 det(l dfe + -^E{CfH,W,WfHf C h 



t^k 



(37) 



Proof: This lemma follows from two inequalities. First, 
E{ log 2 det (l dk + ^Cf H fe W fc Wf Hf C fe ) } 
- E{ log 2 det (l dk + ^ £ H fc W/W? Hf C fc ) } < 

t 

(38) 

since CfHfeWfeWfHfCfe and CfH fe W fc WfHfC fc 
have the same distribution, and the second term contains 
additional positive semi-definite matrices. Second, applying 
Jensen's inequality on the concave function log 2 det(-) gives 

E{ log 2 det (l dk + ^J2 Cf H fc W,Wf Hf C fe ) } 



< 



log 2 det (l dk + -J2 E W H fc W/W? Hf C, 



The lemma follows from combining ( [38] ) and ( [39] ). 



(39) 



Appendix B 
Proof of Theorem[T] 

Using ( fT0| ), the asymptotic difference can be expressed as 
/ n fceg B D det(H fc WB D wf^Hf) \ 

p ~ g2 \ rw icf Hfc wi-| 2 )■ (40) 

Assume that c& G C Mxl is selected suboptimally as the 
dominating eigenvector of Rr^. This is a lower bound 
because every judicious selection based on perfect CSI (e.g., 
MRC) achieves better performance, but it is convenient since 
cj^Hfc = ^/A^Mh^ with hfe ~ £A/"(0, Ijv). The asymptotic 
difference can thus be upper bounded as 

n,^Bodet(H fc WfWf^Hf)\ 



/3<log 2 



n^z FC |hfwfC|2 



(41) 



kes BD 



log 2 det(Rfl ) fe) 



kes ZFC 



log 2 (A fe) M). 



We have ^bd-zfc = and the expectation of the first 

term of pT] ) can be rewritten as the first term in (T\} by 
applying (22[ Theorem 3]. The cited theorem was stated for 
uncorrected channels, but can be applied in our scenario 
since Wj? D is not affected by the receive- side correlation 
matrices Rr^ W G <S zfc ; see Lemma [5] in Appendix |a| which 
shows that receive correlation will not affect the row space 
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of channels (i.e., the spaces where inter-user interference are 
canceled). Finally, observe that the last two terms of ( |4T] ) are 
deterministic and correspond to the second term in ( fTl) . 

Appendix C 
Proof of Theorem[2] 

We begin with BD and assume that there are K candidates 
to become the new user k, JC = {1, . . . , K}, while the other 
users in <S BD are fixed. Since |<S BD | = all available degrees 
of freedom are consumed by the interference cancelation. The 
precoding matrix W BD is therefore completely determined by 
the common null space of the co-users' channels and fixed in 
this proof. 

Minimizing ([14]) corresponds to finding the user i G JC with 
the row space of most compatible with Wj? D . For a user 
candidate i G JC, we can lower bound ( [H) as 

-E{log 2 det(B,Wfwf^Bf)} 

= -ME{log 2 (det(B^Wf W* D '^Bf )^ M )} 

'tr{B,WfWf^Bf}\\ 



> -ME <^ log 2 



M 



> ~M\og 2 



E{tr{B,WfWf^Bf}} 



-Mlog 2 1 



M J 
EMB^W^Bf} 



M} 



M 



(42) 



The first inequality is the classic inequality between arith- 
metic and geometric means, while the second inequality 
follows from applying Jensen's inequality on the convex 
function — log 2 det(-). The final expression in ( |42| ) contains 
M - trjB^Wjpwf^Bf }, which is the squared chordal 
distance between B^ and Wj? D . 

Since the matrices B^, for i G /C, are independent and 
isotropically distributed on the Grassmannian manifold Qn,m 
irrespectively of the receive-side correlation (see Lemma [2]), 
we can bring in results from (29) on quantization of Grass- 
mannian manifolds using K random codewords. From [29, 
Theorem 4], we have the following lower bound on the average 
squared chordal distance (for sufficiently large K)\ 



mm E{M-tr{B,Wf Wf ' H Bf }} 



> 



M(N - M) 
M(N — M) + 



" M(N-M) 



■^ C N,M,M,2 11 



(43) 



where cn,m,m,2 is a positive constant defined in |29j Eq. (8)]. 
Plugging ([43} into ( |42| ) yields the lower bound for BD in the 
theorem. 

A similar approach can be taken under ZFC (by setting 
M = 1 in the derivation), but the M receive antennas provide 
degrees of freedom to select the effective channel as the vector 
in the row space of that minimizes the chordal distance to 
w| FC . This is done by the QBC approach in [7], which was 
derived for uncorrected channels but can be applied under 
receive correlation due to Lemma [2] We apply | 7, Lemma 1], 



which says that the minimal chordal distance is the minimum 
of K independent (3(N — M, M) -distributed random variables. 
This quantity can be lower bounded by taking the minimum 
of K independent f3(N — M, 1) variables and further lower 
bounded by the quantization bound in (291 Theorem 4]: 



min E < 1 - 



uH w ZFC 



U\\2 



> 



(N-M)K~ 



(N-M)- 



(N-M) 

-AT-M+1,1,1,2 

(44) 



where c/v-m+i, 1,1,2 is a positive constant defined in |29] 
Eq. (8)]. Plugging ([44]) into ([42]) (with M = 1) yields the 
lower bound for ZFC in the theorem. 

Appendix D 
Proof of Theorem[3] 

Using Lemma [2] the row space of the correlated channel 

1/2' — — 

H/e = R^ fe Hfc is the same as for the uncorrected channel 
H/e. Consequently, W]? D will be isotropically distributed on 
the Grassmannian manifold Gn,m, just as for uncorrected 
channels in [6, Theorem 1]. The performance loss can there- 
fore be bounded using Lemma [3] and it only remains to 
characterize E{H fe Wf D Wf^H£} 

E{H,Wf>Wf^Hf} 



for £ ^ k. Observe that 



R^E{L fe Q fe Wrwr'"QM}R 



' R,k 



(45) 



R^LfeQfe, where L k e 



^MxM 



using that 

is the lower triangular matrix and Q k € C^ x7V is the semi- 
unitary matrix in an LQ decomposition of Observe that 
Lfc and are independent, thus we can calculate their 
expectations sequentially as 

E{L fc Q fc Wfwf^QfLf} 



N-M 



E{L fc I M Lf} 



ND BD 
N-M 



(46) 



The first equality follows from [6, Eq. (43)-(45)], while the 
second follows from E{L fc Lf } = Nl M (since E{H fc Hf } = 
NIm)- Plugging ( [46] ) into Lemma [3] yields 

A BD <log 2 detfl M + Sf^-ll#^R^ (47) 



N-M 



from which ( [20] ) follows directly. The approximate expression 
for D BD is given in Eq. (26)]. 

Appendix E 
Proof of Theorem[4] 

This proof follows along the lines of |7, Theorem 1], 
with the difference that 1) we have spatial correlation at the 
receiver; and 2) we use QBC also under perfect CSI. Using 
Lemma [2] we observe that the row space of the correlated 

— 1 /2 — 

channel = R^ 7 k Hk is the same as for the uncorrected 
channel Since the gain of the effective channel is ignored 
in p2| ), the error-minimizing codeword is the same as for 
uncorrected channels and we can apply (71 Lemma 2] to con- 
clude that the direction of the effective channel hk = H^c^ BC 
is isotropically distributed. The beamforming vector w| FC is 
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independent of and also isotropic, thus the performance 
loss can be bounded using Lemma [3] It only remains to 
characterize E{|hf wf c | 2 } = E {||h,|||}E{||^pwf c } for 
i 7^ k. The second factor equals using (37 Lemma 2] 
and (7} Eq. (17)], while the average norm E{ |||} of the 
effective channel is nontrivial. To enable reuse of results from 



let c 



-U-QBC 



be the QBC for the uncorrected channel 



and observe that 



1/2-U-QBC 



~QBC 



^Vr,/c C k 



|R^/ 2 c^ QBC || 2 * 



(48) 



^R,k 



We can therefore express the effective channel as 



H 



tf-QBC 
k C k 



1/2-U-QBC 



H 



H 



^R,k C k 



if -U-QBC 



k ^k 



-1/2 -U-QBC I 
•R,k H I 



IIR 



|p -1/2-U-QBC 11 



and its squared norm will be 



|-G-^~U-QBC M 2 
l^-k c k 



2 -U-QBC, H -p -1 -U-QBC " 
C k ^R^k 



(49) 



(50) 



The first factor is the same as under uncorrected fading and 



has E{||H£c™-||l} = N — M + 1 (see |7| Lemma 4]), 



lM' QBC \\t} = 
while the second factor depends on Rr^. Since both the 

quantization codebook and H/e are isotropically distributed, 
c^" QBC is also isotropic and the two terms in ( [50] ) are inde- 
pendent. To characterize the second term, observe that c^" QBC 
can be viewed as a normalized uncorrected circular- symmetric 
complex Gaussian vector. By using that the eigenvectors 
of Ri^fc are not affecting the distribution and that squared 
magnitudes of £A/"(0, 1) -variables are exponentially distributed 
(39), we conclude that the second term of ( |50| has the same 
distribution as 



(51) 



for independent exponentially distributed ^ ~ Exp(l). For 
any a such that A^ m < a < \k,m+i> we can write the CDF 
as 



Pr ( ^Sjjj < a 



Z^i = l At. n. 



{ rn / x M 

i=l v \ ^ i=m+l 



>0 




This is the difference of two sums of exponentially distributed 
variables (with distinct positive variances). The PDF of each 
sum is characterized by (39| Theorem 4] and by calculating 
their convolution and integrating over all positive values, we 



achieve the CDF 



Pr ( iir 1 ^ < a 



^=1 \ k ,i 

m M 

E E 



(^ n -a- 1 ) m (a- 1 -^) 



M-m-1 



A I 



n=1 £=m+1 (Mn - Mt) 11 (Mn - Mi) EI (Mj - Mt) 

2 = 1 J =m,+ l 

(53) 

using the simplifying notation \i n = . The corresponding 
mean value is achieved from the CDF by simply taking the 
derivative and sum up the mean values over each a-interval. 
By multiplying the mean value expression with N — M + 1 
(i.e., the contribution of the first part in ([50])), we achieve the 
expression for Gk- 



Appendix F 
Proof of TheoremO 

The proof follows along the lines of Theorem [3j with the 
difference that we consider CSI estimation errors instead of 
quantization errors. First, observe that both Wf D and W^ D are 
isotropically distributed on the Grassmannian manifold Gn,m 
(since receive-side correlation is not affecting the row space 
of Hfe and see Lemma [2}. The performance loss can 
therefore be bounded using Lemma [3] and it only remains to 
characterize E{H fc W? D W? D ' iy Hf } for i ^ k. From (28) 
we have 



Hz, — 



(54) 



where the second term is the estimation error, H E , 



TfT fc 



1 , and Efe has CA/"(0, 1) -entries. By using 



that H fe Wf = for i ^ k, we achieve 



E{H,WrW^'"Hn = R| >fc E{E fc W^Wr^Ej?}R| >fc 

(55) 



where E{E fc W^ D Wf D '^Ef } = Mlj^ since E fe is 
complex Gaussian and independent of W^ D . Therefore, 
E{H fc wpwf' H Hf } = M(R-U T " T 



Lj^-1 



Appendix G 
Proof of LemmaQ] 

1/2 — 

Observe that H = R^ 7 H has the same distribution 
as HU for any unitary matrix U. Thus, we can rotate h 
arbitrarily without changing the statistics, meaning that 
must be isotropically distributed. Next, note that ||h||2 = 
llc^HUHl = llc^HUl, thus unitary rotations will not affect 
the effective channel gain meaning that the direction and 
the channel gain are statistically independent. | [ lei 1 1 § is the 
dominating eigenvalue of the correlated complex Wishart 
matrix HH F G Wm(^ Rr). The mean value expression 
in ( [32] ) can be achieved directly from HOI Theorem 3] or by 
using the moment generating function inplj (which gives an 
equivalent expression that looks slightly different). 
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Appendix H 
Proof of Theorem [6] 

This theorem is proved in the same way as Theorem [5] The 
only notable difference is that we use the effective channel 
hfe, which has a single effective receive antenna, instead of the 
original channel The effective channel is zero-mean and 
has an average channel gain IE{ ||h^ |||} given by ( [32] ). Thus, 
the effective channel and its channel estimate is related as 

h " = £f +(wwtm + ^)" 1/2g " where ^ ~ c -^(°> 
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